-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/expect: fix data race #16173
pkg/expect: fix data race #16173
Conversation
wow, great catch. I wonder if we can simplify the whole expect a little by serialization? I'm not entirely sure whether we have many cases that spawn a longer running task and require an "early" expect result for test/runtime reasons. TestGrpcProxyAutoSync seems to be one such candidate. |
I was thinking that case. But there are some cases to expect process is ready, like etcd/tests/framework/e2e/util.go Lines 31 to 42 in ac58abc
|
indeed, that's also what I found browsing for it. Alternatively we can also check the readiness probe instead? Anyway, I think that's a bigger change overall. |
Agree. It's reasonable to use expect pkg for short-live process only. It needs ETCD server to expose more metrics in API to replace checking the logging. Maybe we can track this in issue for enhancement. |
pkg/expect/expect.go
Outdated
// NOTE: The ep.read() goroutine needs the mutex so we have to drain the | ||
// log without lock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we have to drain the log without lock.
Could you describe where we drain the log without lock? I thought ep.read() always drains the log with the mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the unclear comment.
Lines 102 to 138 in 4364e75
func (ep *ExpectProcess) read() { | |
defer ep.wg.Done() | |
defer func(fpty *os.File) { | |
err := fpty.Close() | |
if err != nil { | |
// we deliberately only log the error here, closing the PTY should mostly be (expected) broken pipes | |
fmt.Printf("error while closing fpty: %v", err) | |
} | |
}(ep.fpty) | |
r := bufio.NewReader(ep.fpty) | |
for { | |
err := ep.tryReadNextLine(r) | |
if err != nil { | |
break | |
} | |
} | |
} | |
func (ep *ExpectProcess) tryReadNextLine(r *bufio.Reader) error { | |
printDebugLines := os.Getenv("EXPECT_DEBUG") != "" | |
l, err := r.ReadString('\n') | |
ep.mu.Lock() | |
defer ep.mu.Unlock() | |
if l != "" { | |
if printDebugLines { | |
fmt.Printf("%s (%s) (%d): %s", ep.cmd.Path, ep.cfg.name, ep.cmd.Process.Pid, l) | |
} | |
ep.lines = append(ep.lines, l) | |
ep.count++ | |
} | |
// we're checking the error here at the bottom to ensure any leftover reads are still taken into account | |
return err | |
} |
If ep.read()
just reads last line from pty, it will try to get lock and then append the log into ep.lines
. So, if the ep.Expect
waits the ep.readCloseCh
channel close with the lock, the ep.read()
doesn't have a chance to get the lock and then the ep.readCloseCh
won't be closed until ep.Expect
releases the lock. It's deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification!
// NOTE: The ep.read() goroutine needs the mutex so we have to drain the
// log without lock.
I guess ep.ExpectFunc
does not drain the log which causes my confusion.
How about updating to?
// NOTE: we wait readCloseCh for ep.read() to complete draining the log before acquring the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one nit, Thanks @fuweid!
pkg/expect/expect.go
Outdated
// NOTE: The ep.read() goroutine needs the mutex so we have to drain the | ||
// log without lock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification!
// NOTE: The ep.read() goroutine needs the mutex so we have to drain the
// log without lock.
I guess ep.ExpectFunc
does not drain the log which causes my confusion.
How about updating to?
// NOTE: we wait readCloseCh for ep.read() to complete draining the log before acquring the lock.
line := ep.lines[i] | ||
i++ | ||
if f(line) { | ||
return line, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think any consumer of ExpectFunc
relies on the exit error, so return nil should be fine.
abc05ee
to
cfc8a57
Compare
Thanks @chaochn47 ! updated |
ping @ahrtr @serathius |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a very minor comment.
Thanks @fuweid
pkg/expect/expect.go
Outdated
@@ -69,6 +71,7 @@ func NewExpectWithEnv(name string, args []string, env []string, serverProcessCon | |||
}, | |||
} | |||
ep.cmd = commandFromConfig(ep.cfg) | |||
ep.readCloseCh = make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, we can just put this line into the struct initialization,
ep = &ExpectProcess{
cfg: expectConfig{
name: serverProcessConfigName,
cmd: name,
args: args,
env: env,
},
readCloseCh: make(chan struct{}),
}
Let's say there is command which outputs one line and exit with error. There are three goroutines to acquire the lock: 1. ep.read() 2. ep.waitSaveExitErr() 3. ep.Expect() When ep.read goroutine reads the log but it doesn't acquire the lock in time, the ep.waitSaveExitErr acquires the lock and updates the `exitErr`. And then ep.Expect acquires lock but it doesn't see any log yet and then returns err. It's hard to reproduce it in local. Add the extra sleep can reproduce it. ```diff diff --git a/pkg/expect/expect.go b/pkg/expect/expect.go index a512a3ce4..602bea73f 100644 --- a/pkg/expect/expect.go +++ b/pkg/expect/expect.go @@ -128,6 +128,7 @@ func (ep *ExpectProcess) tryReadNextLine(r *bufio.Reader) error { printDebugLines := os.Getenv("EXPECT_DEBUG") != "" l, err := r.ReadString('\n') + time.Sleep(10 * time.Millisecond) ep.mu.Lock() defer ep.mu.Unlock() ``` See it once in Github Action [1]. In order to fix it, the patch introduces `readCloseCh` to wait for ep.read to get all the data and retry it. [1]: etcd-io#16137 (comment) Signed-off-by: Wei Fu <[email protected]>
cfc8a57
to
56edfa6
Compare
Let's say there is command which outputs one line and exit with error.
There are three goroutines to acquire the lock:
When ep.read goroutine reads the log but it doesn't acquire the lock in time, the ep.waitSaveExitErr acquires the lock and updates the
exitErr
. And then ep.Expect acquires lock but it doesn't see any log yet and then returns err.It's hard to reproduce it in local. Add the extra sleep can reproduce it.
See it once in Github Action [1]. In order to fix it, the patch introduces
readCloseCh
to wait for ep.read to get all the data and retry it.Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.